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Abstract: In this paper, we present a distributed infrastructure based on wireless sensors 
network and Grid computing technology for air pollution monitoring and mining, which 
aims to develop low-cost and ubiquitous sensor networks to collect real-time, large scale 
and comprehensive environmental data from road traffic emissions for air pollution 
monitoring in urban environment. The main informatics challenges in respect to 
constructing the high-throughput sensor Grid are discussed in this paper. We present a two- 
layer network framework, a P2P e-Science Grid architecture, and the distributed data 
mining algorithm as the solutions to address the challenges. We simulated the system in 
TinyOS to examine the operation of each sensor as well as the networking performance. 
We also present the distributed data mining result to examine the effectiveness of the 
algorithm. 

Keywords: urban air pollution, sensor network, grid, distributed data mining. 



1. Introduction 

Transport has a significant impact upon the environment in which we live. In general, these impacts 
can be divided under four broad headings: local air quality, climate change, noise and watercourse 
pollution [1], while the clean air is vital to human health. High levels of fine particulate (PMio) air 
pollution in 2005 were estimated to have caused 1,031 accelerated deaths and 1,088 respiratory 
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hospital admissions in London [2]. The Mayor's Air Quahty Strategy [3] was pubhshed in 2002 to deal 
with local air quality and its impact on health. And The Control of Dust and Emissions from 
Construction and Demolition Guidance [4] was issued in 2006. These documents are used to manage 
the complex issues of air pollution in London and to develop the London Olympic facilities for 2012 
with the minimal impact on London's environment. 

However, the volumes of particles and the oxidation of nitrogen in London are still higher than the 
limitations declared in the Air Quality Standards [5]. The major source of air pollution in London is 
road traffic emissions. The Environment Agency estimates that traffic sources account for over 97% of 
CO and 75% NOx emissions. Other notable contributions come from industrial plant and premises, 
domestic energy production, and construction activity. 

In order to monitor the pollutants and analyze their effects to the environment, we developed Mobile 
Discovery Net (MoDisNet in short) to collect real time pollution data on key aspects of traffic 
conditions, emissions, ambient pollutant concentration and human exposure. The purpose is to develop 
the capability to measure, model and predict a wide range of environmental pollutants and hazards 
using a grid of pervasive roadside and vehicle/person-mounted sensors. 

Developing a sensor network over a target region will face a lot of challenges. These include 
developing and extending existing e-Science Grid, sensor units, communication and modeling 
technologies to enable the integrating of data from heterogeneous fixed and mobile environmental 
sensors grids in real time to provide dynamic estimates of pollutants and hazard concentrations; 
demonstrating how these can be usefully correlated with a wide range of other complementary dynamic 
data, such as weather, health or traffic data. 

In the remainder of this paper, we first describe the motivations for the development of MoDisNet 
system as well as the main contributions of this paper. Then we discuss the novel techniques we 
provide to address the problems when a sensor grid is constructed based on the mobile and high- 
throughput real-time data environment. We also present the system architecture to meet the demands of 
the project as well as the sensor unit itself. This is then followed by the simulation platform design and 
the networking performance simulation as well as the real-time pollution data analysis scenarios. We 
conclude the paper with a summary of the research and a discussion of future work. 

2. Motivations and Contributions 

Road traffic makes a significant contribution to the following emissions of pollutants: 
benzene(C6H6), l,3'-butadiene, carbon monoxide(CO), lead, nitrogen dioxide(N02), OzoneCOs), 
particulate matter(PMio and PM2.5) and sulphur dioxide(S02). The impact of local air quality pollutants 
on the environment and health have been studied and well documented [6]. We summarize the 
interaction and cooperation chain of the population, traffic, air quality and health as Figure 1. 

The figure shows that, increased car ownership and use in urban areas (road traffic) generate some 
chemical emissions to the air to form the air pollution. With various weather conditions (effected by 
the temperature, wind, humidity, pressure, etc.), these pollutants pose different air qualities. When 
human beings expose to the polluted air (especially in the urban areas), driving in heavy traffic, near 
the highways or at the 'downwind' locations, with the dose-response, people may suffer breathing 
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problems and asthma attacks, which will contribute to risk of heart attacks among people with heart 
disease. 

Figure 1. The adverse health impacts chain. 
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Under the current Environment Act of UK [7], most local authorities have air quality monitoring 
stations to provide environmental information to public daily via internet. To date, the development of 
work in these areas has been hampered by critical data gaps and asymmetries in data coverage, as well 
as the lack of on-line data processing capability offered by the e-Science. Information on a number of 
key factors such as individual driver/vehicle activity, pollution concentration and individual human 
exposure has traditionally either simply not been available or only available at high levels of spatial 
and temporal aggregation, which average out scientifically critical local variations. For example, the 
conventional approach to assessing pollution concentration levels is based on data collected from a 
network of permanent air quality monitoring stations. 

However, permanent monitoring stations are frequently situated so as to measure ambient 
background concentrations or at potential 'hotspot' locations and are usually several kilometers apart. 
According to our earlier research of 'Discovery Net EPSRC e-Science Pilot Project' [8] (the data 
generated from statically located urban pollution monitoring sensors), we learnt that the pollution 
levels and the hot spots change with time as shown in Figure 2. However, those results are all 
computed offline and can't give a real-time track. As the result, it can't make a prompt feedback or 
supervision to individuals and the air pollution monitor systems. 

Figure 2. Pollution levels change at East London during a day. 
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Besides, while traffic monitoring systems provide information on aggregate traffic parameters, they 
do not inform on individual vehicle trajectories, in particular, key features (for emissions estimation) 
such as the incidence of acceleration, deceleration and idling episodes. Moreover, epidemiological 
studies typically base estimates of exposure on home post code, despite the fact that many people 
clearly spend large parts of the day in other locations. These data gaps have led to a number of critical 
barriers to the successful development of key research issues. These include: 

1. Model validation: Conventional environmental data sources do not provide a sufficient detail of 
temporal or spatial resolution to enable existing or emerging traffic, vehicle emissions and 
pollution dispersion models to be validated at the micro-scale, especially at street level. This is 
inhibiting the development of necessary understandings of how best to design local traffic 
management and urban design interventions to reduce pollution concentrations in critical areas, 
including pollution 'hot spots'. 

2. Human exposure: Little or no data are available at the disaggregate level on individual exposure to 
pollutants, which is similarly inhibiting the development and validation of exposure modeling. 
This has important implications, not only scientifically but also practically, especially in terms of 
demonstrating compliance with existing and future regulatory obligations. 

3. Integrated traffic and environmental control: Existing traffic monitoring systems enable adaptive 
traffic control systems such as SCOOT [9] to draw on real time information on aggregate traffic 
parameters in order to dynamically optimize network performance to reduce delays (by adjusting 
signal timings and related measures). The extension of these techniques to the joint optimization 
of both traffic and environmental outcomes is highly desirable, but currently impossible because of 
the lack of comparable real-time pollution concentration data. 

We can address these concerns by two ways: generating new forms of data (e.g., on exposure and 
driver/vehicle activity) and generating data at higher levels of spatial and temporal resolution than 
existing sensor systems. Taking advantage of the low cost mobile environmental sensor system, the 
MoDisNet system will construct a Grid environment which fully integrates existing static sensor 
systems and complementary data sources with the mobile environmental sensor system, which will 
provide radically improved capability for the detection and monitoring of environmental pollutants and 
hazardous materials. 

The main contributions of this paper are: first, we propose a highly effective air pollution 
monitoring system which fully considers the urban background and the pollution features. In this 
system, a hierarchical network architecture formed by the mobile sensors and stationary sensors is 
designed, which makes full use of the roadside devices to fix the stationary sensors as well as the 
public vehicles to carry the mobile sensors; a ultra violet sensor unit GUSTO which can realize up to 
IHz data collection frequency with high accuracy and low unit cost; a sensor grid framework to 
provide the processing, integrating, and analyzing heterogeneous sensor data in both centralized and 
distributed ways. Second, we provide a solution of executing the real-time distributed data mining in 
sensor grid; design a distributed P2P clustering algorithm for MoDisNet system. Our result also 
provides a typical air pollution pattern in urban environment which gives a real-time track of the air 
pollution variation. The result also presents important information about environmental protection and 
individual supervision. 



Sensors 2008, 8 



3605 



3. Air Pollution Monitoring System Infrastructure 

The key feature of the MoDisNet system is to use a variety of vehicle fleets including buses, service 
vehicles, taxis and commercial vehicles a platform for environmental sensors. With the collaboration 
of the static sensors fixed on roadside, the whole system can detect the real-time air pollution 
distribution in London. We will describe the MoDisNet architecture in the following sections in 
details. 



3.1. MoDisNet Network Architecture 



The MoDisNet system is constructed based on a novel network environment, which is designed as a 
two-layer network architecture - the mobile sub-network formed by the Mobile Sensor Nodes (MSN in 
short) and the stationary sub-network organized by the Static Sensor Nodes (SSN in short). The 
network architecture is shown in Figure 3. 



Figure 3. The network architecture of MoDisNet. 
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Here, MSNs are installed in the vehicles. They sample the pollution data and execute the AD 
conversion to get the digital signals. According to the system requirements, the MSNs may pre-process 
the raw data (such as the noise reduction, local data cleaning and fusion, etc.) and then send these data 
to a nearest SSN. The SSNs take in charge of the data receiving, update, storage and exchange works. 
Cooperating with the e-Science Grid architecture (which will be discussed in detail in section 3.3), the 
SSNs can realize the distributed data analysis and mining. According to different requirements from 
the users or the server, the SSNs may send the raw air pollution data or the distributed mining results to 
the central data warehouse for further process. 

3.2. GUSTO Sensor Unit 

The sensors (including MSN and SSN) we designed within MoDisNet are GUSTO sensor units. 
GUSTO is an acronym for Generic Ultra violet Sensor Technologies and Observations. It is designed 
to quantify relative concentrations of a suite of urban air pollutants in real time. The key features of the 
GUSTO unit are: 
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1. Simultaneous detection of multiple species of pollutants (SO2, NOx, O3, Benzene and others) 

2. Real time data collection and transmission (sampling frequency is approximately IHz) 

3. Relative low unit cost (compared to permanent monitoring sites) 

4. Robust (self corrects for background changes for each scan) 

5. Accurate over ambient concentrations (ppb levels) 

GUSTO makes use of the characteristic narrow band absorption of the gas under study (includes 
SO2, NO, NO2, O3, NH3, and Benzene) in the UV spectral range 200-300nm. Retrievals are based on a 
variation of the well established Beer-Lambert Law, which describes the empirical relationship that 
relates the absorption of light to the properties of the material through which the light is traveling. 
Accordingly, the amount of light emerging from a sample is diminished by three physical phenomena: 

1. The amount of absorbing material in its optical path (concentration) 

2. The distance the light must travel through the sample (path length) 

3. The probability that the photon of that particular wavelength will be absorbed by the material 
(absorptivity or extinction coefficient) 

This relationship can be expressed mathematically and has been covered in several previous works [8]. 

A schematic of the GUSTO unit for deployment within the MoDisNet program is presented in 
Figure 4. When vehicles pass within the vicinity of the sensor, key pollutants (SO2, NOx, O3, etc) 
emitted by such vehicles absorb UV light at characteristic frequencies and then detected by the GUSTO 
sensors as illustrated in Figure 4(a). 

Figure 4. GUSTO sensor unit schematic representation 
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Figure 4(b) shows that the sensor unit primarily consists of four main components: (1) the 
Deuterium Light Source (DLS), (2) UV optics in the form of a Multi-pass White Cell (MWC), (3) a 
Spectrometer and a Linear CCD unit, and (4) the sensor control unit for data processing and 
transmission. The sensor is closed path and ambient air is sampled at a frequency of around IHz. The 
UV light from the DLS passes through the Deuterium and the resulting spectral output is directed along 
an open optical path via a set of Vertical Transfer Mirror. Then the spectral output imaged onto the 
surface of the CCD detector. The intensity values are obtained via a 14-bit ADC to produce an 
atmospheric spectrum of wavelength versus intensity over the GUSTO range. The narrow absorption 
features are subsequently de-convolved from the atmospheric spectrum and the resulting differentials 
are used to calculate the concentration of each absorber. 

3.3. e -Science Infrastructure Based on Grid Computing 

3.3.1 e-Science and Grid 

Today's research depends increasingly on communication and cooperation. More and more, 
scientists need to share resources through distributed computing and databases, gaining access to 
specialized and expensive facilities by developing national and international collaborations. Often, this 
involves integrating complex data repositories, terascale computing and high performance visualization 
now available in many research areas. 

The term Enhanced Science, or e-Science is 'refer to the large scale science that will increasingly be 
carried out through distributed global collaborations enabled by the Internet. Typically, a feature of 
such collaborative scientific enterprises is that they will require access to very large data collections, 
very large scale computing resources and high performance visualization back to the individual user 
scientists.' [10] 

In this description, as in many others, e-Science is closely associated with Grid computer network 
architecture that enables much of the global collaboration considered basic to e-Science [11, 12]. The 
Grid is an architecture proposed to bring all these issues together and make a reality of such a vision 
for e-Science. Ian Foster and Carl Kesselman, inventors of the Globus approach to the Grid, define the 
Grid as an enabler for Virtual Organizations: 'An infrastructure that enables flexible, secure, 
coordinated resource sharing among dynamic collections of individuals, institutions and resources.' It 
is important to recognize that resource in this context includes computational systems and data storage 
and specialized experimental facilities [10]. 

Currently, some research groups are working on the e-Science architecture design and development, 
including the TIME-EACM [13] project based at the University of Cambridge Computer Laboratory, 
and the North East Regional e-Science Centre (NEReSC) [14] based at the University of Newcastle, 
etc. Most of their researches address the issues of real-time data query, distributed data access and 
heterogeneity management. 

However, MoDisNet aims to develop and deliver a making system for pervasive mobile 
environmental sensors. The work is based on developing a 'Mobile Sensor Data Grid' for processing, 
integrating, and analyzing heterogeneous sensor data. Based on the former research 'Discovery Net 
EPSRC e-Science Pilot Project' [8], we have developed a service-based infrastructure for scientific 
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informatics that supports the analysis of data generated from statically located urban pollution 
monitoring sensors. However, the support of a high-number of mobile sensors within a dynamic 
environment presents new challenges to Discovery Net. The number and nature of such sensor 
networks mean that in most instances the data cannot realistically be warehoused and then analyzed 
off-line. A new paradigm is required where much of the analysis is performed within the network itself 
using pervasive computing technologies such as the Peer-to-Peer (P2P) model [15]. This is feasible as 
even rudimentary sensors will have processing capability and sensor-to-sensor protocols can be 
extended to support the dynamic real-time execution of analysis and mining algorithms. 

3.3.2 P2P-based Sensor Grid architecture in MoDisNet 

Within large scale mobile sensor network architectures, the sensors themselves naturally form and 
communicate with each other as a P2P network. As the GUSTO sensor can measure pollutants at very 
high level of accuracy and throughput at very short intervals, which means the volumes of generated 
and transferred data can be up to gigabit magnitude each day per sensor. This raises many informatics 
challenges to the data process and storage. In order to satisfy the real-time analysis requirements, the 
sensors themselves must store part of the information and communicate with each other within a P2P 
network. The measurements from sensors, both mobile and static, will be filtered and processed using a 
set of specialized algorithmic processes, before being warehoused within a repository. In order to 
satisfy the real-time analysis requirements as well as the data storage/communication trade-offs, the 
sensors in MoDisNet Grid are equipped with sufficient computational capabilities to participate in the 
Grid environment and to feed data to the warehouse as well as perform analysis tasks and 
communicating with their peers. 

Within MoDisNet the sensor grid should be developed to support two important techniques: firstly 
the techniques allowing the analysis of transport, weather and pollution data in real-time using P2P 
methods and protocols; secondly the design of new communication protocols supporting dynamic real 
time data aggregation and statistics. These will provide MoDisNet with the ability to support the full 
scale analytical task ranging from dynamic real time mining of sensor data to the analysis of off-line 
data warehoused for historical analysis. To satisfy the demands above, the MoDisNet sensor grid 
architecture is designed as illustrated in Figure 5. 

The GUSTO sensors (including SSN and MSN) connect to the MoDisNet Grid by several Sensor 
Gateways (SGs) according to different wireless access protocols. The sensors are capable of collecting 
the air pollution data up to IHz frequency and sending the data to the remote Grid service hop by hop 
(a multi-hop style). This capability enable the sensors exchange their raw data locally and then realize 
the data analysis and mining in distributed way. This capability also presents the potential for further 
data fusion and aggregation (which is beyond the research of this paper, and we will discuss it briefly 
in section 7). The SGs take in charge of connecting the wireless sensor network with the IP backbone, 
which can be either wired or wireless. These SGs can monitor the volumes of the data streams from the 
sensors and execute the load balancing function to avoid transfer collisions, which is very useful for 
improving the throughput and performance of the Grid architecture. A warehouse that can be accessed 
by SQL database is managed by the Grid architecture which centrally stores and maintains all the 
archived data, including derived sensor data and the third part data such as the traffic data, the weather 
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data and the health data. These data can provide wealth of information for the Grid computation to 
generate the short-term or long-term models that relate to the air pollution and traffic, down to the level 
of streets and buildings. Further more, it may give the supervision for the prediction of the forthcoming 
events about the traffic change and pollution trend. In term of the visual workflow tools, it can also 
provide real-time output to the end user to make a full understanding of the air pollution and traffic 
conditions in different locations within the monitor areas. 



Figure 5. The MoDisNet sensor grid architecture. 
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4. Data Mining 

4.1 Data Mining Requirements within MoDisNet 

Within the MoDisNet Project, a substantial level of resource has been assigned to the development 
of a set of data mining techniques specific to the needs of research into the relationships between urban 
transport and the environment. In this system, we can define the data mining tasks by three aspects: 

4.1.1 Centralized Data Mining 

The data stored within the central data warehouse will often need to be queried by the end-users in 
order to find regularities in the pollution fluctuations, traffic data and other sensor information. These 
queries will be easily configurable using Grid service elements within the MoDisNet environment, and 
potentially published as Grid services themselves. In this scenario, the data mining is necessarily 
implemented in the warehouse. For the centralized data mining, we seek to identify the long-term 
patterns of pollution and traffic, and thereby, identify expected baseline conditions and key relationship 
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between certain pollutants and traffic events, etc. This will consist of relatively complex mining 
processes for expected relationships and techniques to identify potentially new relationships in the 
data. Due to this complexity and the need to access multiple data archives, each running of the 
centralized mining may take a long time. It will depend on the amount of the data under consideration 
and the complexity of the algorithms. As our former project Discovery Net forms an ideal engineering 
framework for rapidly building up the system, we will not discuss this case in details. 

4.1.2 P2P Based Distributed Data Mining 

Historical data is well-suited to large-scale analysis over multiple dimensions, but for dynamic 
queries over real-time sensor data streams, the data has to be taken directly from the sensors. These 
data points have little value for warehousing and also the real time mining querying cannot afford a 
"store and mining" model. A typical analytical work would involve the statistics at a certain location 
and about certain properties in that location. The sensor may not be able to offer this information on its 
own, due to its movement from the location, or due to inability to capture all relevant information 
pertaining to the query. The dynamically composed sensor network with P2P communication model 
can support such information exchange and distributed streaming mining algorithms to provide real 
time analytical querying model. 

The P2P in-networks data mining is based on the data collected and stored by each sensor in real 
time. Mining these real-time data, the system can give a quick judgment on the pollution status. And 
the result may reflect some change of the traffic situations. The convergence time of the distributed 
mining algorithms must be as quick as possible. At the same time, the results may be very simple and 
not accurate as the centralized one. 

4.1.3 Integrated Data Mining 

Based on the distributed data mining, we can be aware of some abnormal air pollution conditions. It 
may trigger the centralized data mining execution in two ways: 

• The distributed mining results may match a exist pattern achieved by the centralized mining. So 
we can suppose that some traffic events may have happened or will happen. 

• It is a whole new result that the centralized mining never learnt before. So it may run the 
centralized algorithm again to re-calculate a new model that is more complete and accurate than 
the formal one. 

In the following sections, we will focus on the distributed data mining (DDM) technique. As 
described in 4.1.2, this is the key point of the data mining tasks in MoDisNet system. We will first 
overview the background and recent researches on DDM, and then propose a P2P based distributed 
clustering algorithm which is designed for pattern recognition of the urban air pollution. 

4,2 Distributed Data Mining in Sensor Networks 

Data mining in sensor networks faces several challenges. First, sensors are seriously constrained by 
the resource, including battery lifetime, communication bandwidth, CPU capability and storage [16]. 
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Second, sensor node mobility increases the complexity of sensor data collection and analysis [17, 18]. 
Third, sensor data come in time-ordered streams over network, which makes traditional centralized 
mining techniques inapplicable. As in MoDisNet system, not only the types of pollutants measured by 
sensors, but also since sensors may be mobile it is essential to record the locations of the sensors at 
each measurement time. As the result, mining these data requires a multitude of analysis components 
such as statistical, clustering, visualization and classification algorithms and tools. Besides, the analysis 
of spatiotemporal variation of multiple pollutants with respect to one another can be directly performed 
over the collected pollution data, however the correlation with third-party data, such as weather, health 
or traffic is more important and needs novel dynamic data access and integration techniques. 

For sensor networks, the sensors collect data in time sequence and then there is continuous stream 
of incoming data for each sensor. Because of the limited storage capability, storing the historical data 
in each sensor is difficult, even for storing the summary/pattern from the historical data. At the same 
time, sensor network data management may need the on-line analysis results to be presented for the 
real-time monitoring and supervising. As the result, the real-time DDM schemes are significantly 
demanded in such scenario. 

DDM offers an alternate approach to address above problems of mining data using distributed 
resources. DDM pays careful attention to distributed data, computing, communication, and human 
resources to use them in a near-optimal fashion [19]. In recent years, there are a lot of researches on 
DDM in sensor networks. The existing work can be arranged into three aspects: 

1. intelligent data collection schemes to reduce data volume; 

2. optimal nodes organization strategies to eliminate the heterogeneous of sensor data; and 

3. highly efficient mining algorithms to reduce the computing complexity of DDM. 

For the first scenario, an unsupervised approach to the outlier detection problem in sensor networks 
is presented in [20], where kernel density estimators are used to estimate the distribution of the data 
generated by the sensors. [21] studied the problem of detecting regions of interesting environmental 
events, which assumes that faults can occur in the equipments though they would be uncorrelated, 
while environmental conditions are spatially correlated. In [22], the authors partition the sensor grid 
into several subnets. If the density of nodes is high enough, one node in each subnet can be chosen to 
give an estimate of the samples at every other node within its subnet. As the result, the collected data 
volumes are highly reduced. 

Among the strategies of the second aspect, arranging the nodes of the sensor networks in logical 
hierarchical styles is a widely used method. In [23], a multi-dimensional clustering approach was 
proposed to set up a two-layer framework, in which the sensor nodes grouped into cliques on the 
bottom layer and the data abstraction represents the top layer. Under this framework, a distributed 
pattern recognition scheme is executed. [24] also proposed a two-layer modular architecture to 
adaptively perform data mining tasks in large sensor networks. The architecture consists in a lower 
layer which performs data aggregation in a modular fashion and an upper layer which employs an 
adaptive local learning technique to extract a prediction model from the aggregated information. 

The high efficient mining algorithms can present the distributed mining result directly and many 
efforts have been made in this area. In [25], an in-network data mining technique to discover frequent 
event patterns and their spatial-temporal properties was proposed, where each node collects 
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information from its neighbors. [26] presented a self-stabilizing peer-to-peer indexing structure and an 
efficient nearest neighbor querying methods. [27] worked on the distributed error minimization 
algorithm for density estimation and clustering in sensor networks. A distributed approach for event 
prediction was proposed in [28], where each sensor learns to make local decision about capturing the 
environmental change. 

In next section, we will present our distributed data mining algorithm — a P2P clustering algorithm 
for air pollution pattern recognition which aims to reduce the computing complexity and generate real- 
time mining results. 

4.3. Distributed Clustering Algorithm within MoDisNet 

To realize the P2P based DDM, the algorithm has to provide the information exchange in P2P style. 
Here, we designed a clustering algorithm which can satisfy the P2P mining demands. The clustering 
problem is defined as follow: clustering is the process of grouping the data into classes of clusters so 
that objects within a cluster have high similarity in comparison to one another, but are very dissimilar 
to objects in other clusters [29]. Cluster analysis has been widely used in numerous applications, 
including pattern recognition, data analysis, image regions and market research. We use the clustering 
analysis in MoDisNet system to find out the pollution patterns (or pollution clouds) in the urban 
environments. The distributed data process including the data process in both MSNs and SSNs. For the 
MSNs, the main tasks are: 

1. sensing the air pollution data and conveying the analogy signals to digital signals; 

2. storing the sampled data; and 

3. sending the data to nearest SSN if the timer in MSN expires. 
And for the SSNs, the main tasks are: 

1 . receiving data from MSNs; 

2. choosing a certain number of SSNs as the Information Exchange Node Set (lENS) in term of a 
random algorithm (the random algorithm is beyond the discussion of this paper), then sending 
polling message to those nodes and waiting for the data exchange messages; 

3. receiving the data exchange messages from the SSNs in lENS; and 

4. executing the distributed data mining algorithm. 

To realize the data processing and analyzing functions described above, a distributed A:-means 
algorithm (the goal is to find k centers that minimize the maximum distance of a point to its closest 
center [30]) is designed to mining the air pollution patterns in different locations according to the 
sampled air pollutants' volumes. This algorithm runs in each SSN. To describe this algorithm, we 
explain some of the definitions first (suppose the total numbers of SSN is n (n > 0)). 

• SSNf. a SSN node with the identity / (/ = 0, . . ., n-l); 

• Si', an Information Exchange Node Set (lENS) of SSNi, which is a set of some of the SSNs that can 
exchange information with ^^A^^; 

• k: the number of clusters that required in fc-means algorithm (A: > 0); 

• C^if. the cluster center of jth (/ = 0, . . ., /:-!) cluster that is computed in SSNi in /th recursion (/ > 0); 

• Nuniij'. the number of members (data points) belongs to yth cluster in SSNu 

• (5: a pre-defined threshold. 
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1. Node SSNi chooses a certain number of SSNs as Si in term of a random algorithm; 

2. Node SSNi receives data from MSNs as the local data and chooses k local data as the initial local 
cluster centres Cfij (j = 0, ...,k-l); 

3. Node SSNi calculates the Euclidean distance between each local data and each C^^; 

4. Node SSNi distributes each data to the nearest (fij as the member of this cluster, and each local 
cluster of SSNi can be described as (C^y, Nurriij); 

5. Node SSNi sends polling messages in intervals to each SSN in Si and expect to receive data 
information form them; 

6. Node SSNx in Si sends back the local data description (Cfxj, Nurrixj) to SSNi if it receives the 
polling message from SSNi; 

7. If node SSNi receives all the data information it expects send back from all the node in Si, SSNi 
calculates the new cluster centres as C\j; 

8. Node SSNi computes the offset between and C^^, if the offset < d, then the algorithm finishes; 
otherwise replaces Cfij by C\j, and go to step 3. 

5. System Operational Simulation 

5,1, Simulation Platform 

The operational simulation of the MoDisNet system aims to give an overall evaluation of the 
hardware and software design. The simulation will realize not only the wireless communication 
processing and upper layer algorithms, but also the sensor unit itself, including the reasonability of all 
the function modules of the sensor chip, the lower layer driving programs, and the cooperation of 
different types of sensors (MSN and SSN). For this purpose, we designed a simulation platform which 
has the capabilities to simulate all the functions of a single sensor, and the functions of information 
exchange between different sensors. 

To simulate the functions of a single sensor, we designed and coded MSN/SSN units based on the 
TinyOS system. TinyOS is an open-source operating system designed for wireless embedded sensor 
networks. It features a component-based architecture which enables rapid innovation and 
implementation while minimizing code size as required by the severe memory constraints inherent in 
sensor networks [31]. To realize all the functions that a MSN or SSN requires, we have to choose 
suitable system components that the TinyOS system provides, as well as design reasonable 
user/application components based on the functional descriptions listed in section 4.3. Besides, we 
need to use different interfaces to link all the components to make them work together. 

According to the functional description, we designed MSN and SSN in TinyOS system as shown in 
Figure 6 (a) and (b). Here, the grey square stands for a component; the shadow square is the interface 
that a component provides; the dotted square is the interface that a component uses; the line with a 
black arrow means a command sending from the tail of the arrow to the head of the arrow; and the line 
with a white arrow means an event informing from the tail of the arrow to the head of the arrow. 
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Figure 6. The Structure Framework of MSN and SSN. 
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The components that a MSN uses are Main component (which is the access point of all programs); 
the user/application component MSN (which executes all the functions of a MSN); the sampling timer 
component TimerC; the sampling and AID conversion component DemoSensorC and the data 
transmission and multi-hop routing component AODV. Similarly, the components that a SSN uses are 
Main, SSN, TimerC (which is used for polling timing) and AODV. Here, the AODV is the acronym of 
Ad hoc On-Demand Distance Vector Routing [32]. This is an on-demand distance vector routing 
protocol which can support the multi-hop routing scheme and are widely used in the Ad hoc networks. 
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The AODV component is provided by the TinyOS system which helps to reahze the sending and 
receiving of the control information and data messages easily between different sensor units. 

5.2. Visualization of System Operation 

Based on the simulation platform illustrated in section 5.1, we can visualize and monitor the system 
operation in OMNet++ [33]. The purpose of the visualization is to investigate the initialization/ 
configuration of the system and the performance of the routing protocols. Besides, we need to know if 
the polling and data exchange messages can be transferred correctly or not, and if the distributed data 
mining algorithm is performed as expect. 

The network topology of the simulation is designed as Figure 7. There are 18 sensor nodes, 
including 12 SSN nodes from tic[0] to tic[ll] and 6 MSN nodes from tic[12] to tic[17]. Data can be 
sent and received in bi-directions along the edges. We use the air pollution data of four pollutants NO, 
NO2, SO2, and O3 at 1-minute intervals in urban environment from 8:00 to 17:59 within a day as the 
sampled data for each MSN. The total number of the dataset is 7200. 



Figure 7. The network topology of the simulation. 
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The simulation is performed on the computer with P4 CPU, 3.2GHz main frequency. The link delay 
is set to be 100ms. The running details of the system can be monitored as shown in Figure 8. 
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Figure 8. Simulation monitor results 



□MNeT++/Tkenv - tictocl 










File Edit Simulate Trace 


Inspect View Options Help 










ST^I RUN^ Ifas^^ Emfe^l uJtT"! ®| ^ 






Run ttl: Hctocl 


J^Event tt2870 || T=9.2890G25 ( 9.28s) 




jNext: tictocl. tic[4] (id=G) 




Msgs scheduled: 42 




fMsgs created: 3227 


Msgs present: 42 


Ev/sec: n/a 




||Sinnsec/sec: n/a 


Ev/simsec: n/a 


Data-Polling,... 
Data-Polling 




Data-Polling,... 
Data-foiling 






♦ 1 ' ■— 

+1e-4 


■ +"o.t)oi ■ +0:01 ■ ' ■ ■ -^^^ 


' ^ ' ' ' ' ' '+io sec 


tictocl [Tictocl )(id=1 
El- @ scheduled-events (cM 


±1 

d. 


sending packet: pk-DATAEXCHANGE sending out from 5 to 2, nexthop is 1 ! 
sending packet: pk-POLLING sending out from 5 to 3, nexthop is 1! 
receiving packet: pk-POLLING received in 3 from 0 forwarded by 0! 
sending packet: pk-POLLING sending out from 1 to 3, nexthop is 2! 
receiving packet: pk-DATAEXCHANGE received in 2 from 1 forwarded by 1! 
receiving packet: pk-DATAEXCHANGE received in 2 from 1 forwarded by 1 ! 
receiving packet: pk-DATAEXCHANGE received in 2 from 1 forwarded by 1 ! 
receiving packet: pk-POLLING received in 3 from 1 forwarded by 1! 
receiving packet: pk-POLLING received in 3 from 1 forwarded by 1! 
sending packet: pk-POLLING sending out from 2 to 3, nexthop is 3! 
receiving packet: pk-POLLING received in 3 from 1 forwarded by 1! 
receiving packet: pk-DATAEXCHANGE received in 3 from 2 forwarded by 2! 
receiving packet: pk-DATAEXCHANGE received in 3 from 2 forwarded by 2! 


1 
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(b) Running time 10.7421875 seconds. Tic[l] receives 50 packets; Tic[2] receives 71 packets. 



After the initialization, all the nodes begin to exchange the polling and data messages. We can see 
from Figure 8(a) that all the messages can be sent and received between the source node and the 
destination node hop by hop. Figure 8(b) shows the receiving procedures of every packet in tic[l] and 
tic[2] within 10.7421875 seconds, while tic[l] received 50 messages and tic[2] received 71 messages. 
These messages include all the control messages (such as initialization, topology discovery and routing 
information), polling messages and data exchange messages. The receiving speeds in these nodes 
present an approximately linear increasing, which means the system can send and receive messages 
with very low collision and packet lost. 

5.3. Operation of the Distributed Clustering Algorithm 

The distributed clustering algorithm test results in tic[l] are shown in Figure 9. The test dataset is 
the air pollution data of four pollutants NO, NO2, SO2, and O3. Each data record has a time stamp 
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between 9:00 and 10:00, a sensor ID which shows the data source, and four values of the volumes of 
the pollutants. Here we defined the ^-means algorithm with ^ = 3. So there will be three clusters with 
the cluster ID of 0, 1 and 2. Result (a) is a part of the real-time mining result in tic[l]. We can see that, 
in each recursive processing of the algorithm, SSN tic[l] computes the cluster centers and assigns the 
distributed data it gathered to a corresponding cluster. In result (b), the algorithm finishes according to 
the converging criteria and tic[l] gets a local clustering result. The final cluster centers and total 
number of data points in tic[l] are also shown in the result. 

Figure 9. The distributed clustering algorithm test result in tic[l]. 
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(b) The final mining result after the algorithm converged 
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The comparison of the average clustering accuracy of the centralized and distributed clustering 
algorithms is shown in Table 1. For the centralized clustering algorithm, we suppose tic[8] be the sink 
(central point for data processing) in the topology shown in Figure 7, which means every other node 
sends the collected data to tic[8]. And the classic A:-means algorithm [34] is running in tic[8] for 
centralized clustering. For the accuracy measurement, let X' denote the dataset at node /. Let i4^(x) 

and L!(x) denote the labels (cluster membership) of sample x (xe X') at node / under A:-means 
algorithm and our distributed clustering algorithm respectively. We define the Average Percentage 
Membership Match (APMM) as 

APMM =iX^^^^^^^^^^^^^xlOO% 

Where n is the total number of SSNs. 

For the distributed clustering algorithm, we vary the number of nodes in the Information Exchange 
Node Set (lENS) of each SSN from 1 to 10. Data are randomly assigned to each SSN. Table 1 shows 
the APMM results. 
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Table 1. Centralized Clustering vs. Distributed Clustering (APMM results). 
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From Table 1 we can see that, when the number of nodes in EENS is no less than 2, in other words, 
when each SSN exchanges data with at least two other SSNs, the APMM exceeds 90%. When the 
number of nodes in lENS is no less than 4, the APMM exceeds 93%. The results are achieved under the 
condition of assigning the data to each SSN randomly. In reality, if the patterns of the dataset are 
various in different locations, the APMM maybe lower than the results in Table 1. In such situations, a 
good scheme of how to choose the nodes to construct the EENS would be very important. 

6. Data Analysis Scenario 

In this section, we present a real-time pollution data analysis scenario to evaluate the data analysis 
capability of MoDisNet system. This evaluation is based on the air pollution data that was generated 
from our former project Discovery Net which constructed a sensor grid over a typical urban area as 
shown in the map of Figure 10 around the Tower Hamlets and Bromley areas in east London. There are 
some of the typical landmarks such as the main road extending from A6 to LIO, the hospitals around 
C5 and K4, the schools at B7, C8, D6, FIO, G2, H8, K8 and L3, the train stations at D7 and L5 and 
Gas Works between D2 and El. 140 sensors are distributed in this area and collect data from 8:00 to 
18:00 at 1-minute intervals to monitor the pollutants of NO, NO2, SO2 and O3. 

Figure 10. MoDisNet sensors evaluation case in an area of east London. 
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The air pollution analysis uses the air pollution data to give an overall understanding of the air 
pollution characterization within this area by running the data mining algorithm. As the Discovery Net 
can only classify the pollution data into several pollution levels, such as high or low, but can't tell us 
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the distribution of different pollutants in different locations and their contributions to the pollution 
levels. To improve the data analysis capability, in this data analysis scenario, we use the distributed 
clustering algorithm to cluster the pollutants into pollution clouds which can recognize different 
pollution patterns. From the experimental results of Discovery Net, we pickup all the high pollution 
level locations at 9:00, 15:30 and 17:00 respectively to check the contribution of different pollutants 
(NO, NO2, SO2 and O3) to the pollution levels. The results are shown in Figure 11. 

Figure 11. Pattern recognition for high air pollution level areas. 
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From the figures we can see that in the morning at 9:00, the high pollution locations are distributed 
around the main roads and the schools (highlighted by the circles). At 15:30, the high pollution 
locations are around the schools and the factory (which is a gas work and highlighted by the square). At 
17:00, more pollution focuses on the main roads; the hospitals (highlighted by the ellipses) and the 
factory make the contribution as well. Checking the mining data set, we found the relationship between 
the pollution clouds, the different pollutants and pollutants' volumes, which is shown in Table 2. 



Table 2. Pollution pattern analysis. 
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This table shows that, in each time snap, there are 3 different pollution clouds: Red, Blue and 
Yellow. Each cloud represents a kind of combination of the 4 pollutants. For example, in the morning 
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at 9am, the Red cloud that covers the main roads and the area around the main roads is characterized 
by high volumes of NO and SO2. Most of these pollutants are emitted by the vehicles running in the 
morning rush hours. According to the environmental reports, today in the UK, the road transport sector 
is the major source of NOx emissions, especially in urban areas, contributing more than 50% to the 
total emission. Besides, the solid fuel and petroleum products are two main contributors of SO2. 
Herein, it is still an urgent research topic for us to reduce the transport emissions and produce clearer 
diesel fuel. 

At 15:30 in the afternoon, the Blue clouds cover the school areas, which are in high volumes of NO2 
and O3 with low volume of NO. While Yellow cloud featured by high volume of SO2 covers the 
factory area. As NO2 and O3 are all formed through a series of the photochemical reactions featuring 
NO, CO, hydrocarbons and PM, generating NO2 and O3 needs to take a period of time. That is the 
reason that the density of NO2 and O3 in the afternoon is higher than that in the morning traffic peak 
time. 

At 17:00 in the afternoon, it seems to be the worst pollution distribution time within a day. Besides 
the transport emission around the roads and the factory emission, some other locations such as the 
hospitals contribute some kind of pollutants, including the sulphide and nitride. On the right most of 
Figure 1 1, a yellow cloud covering a main road from Ml to LIO contains high density of SO2, NO2 and 
O3. The pollution pattern is very similar to the pattern at the factory and hospital areas, but not similar 
to the pattern on the other main road (from A6 to KIO). We investigated this area and found that, a 
brook flows along this area in the near east and a factory area locates on the opposite side of the brook 
which is beyond the scope of this map. This can explain why the pollution patterns are different on 
these two main roads. 

7. Conclusions 

In this paper, we have provided an overview of the urban air pollution analysis within MoDisNet 
project, describing the network framework, the GUSTO sensor technology, the mobile sensor grid 
architecture and the distributed data mining algorithm. The system can achieve a high performance 
based on the high quality mobile sensing capability of GUSTO sensor unit which can measure 
pollutants at very short intervals. Besides, with respect to the distribution of the sensors within the 
large area of urban environments and the data integration requirements during the transmission and 
analysis, the well designed e-Science Grid architecture and distributed data mining algorithm are 
essential for this scenario. 

We are currently researching on the data fusion and aggregation technique [35, 36, 37] to improve 
the system performance when large amount of data are collected and transferred, especially when the 
third part data (including the traffic data and weather data) are imported. As discussed in section 3.3.2, 
the Grid architecture and multi-hop routing capability enable the MoDisNet system to implement the 
data fusion and aggregation. It can save the communication cost, reduce the power consumption of the 
sensor units, and increase the available bandwidth of the wireless channels. One attempt is to design 
the fusion and aggregation algorithms with the integration of novel routing protocols; another way is to 
analyze the pollution data collected in different locations to find the redundancy of the data, so as to 
decrease the number of active sensors or prolong the sample intervals. Beside, when the mobility 
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pattern of the vehicles is taking into consideration, the data communication and analysis may face new 
challenges, such as noise reducing, positioning, mobile data collecting and identifying, etc. 

As UK has obligations as a member of the United Nations Economic Commission for Europe 
(UNECE) regarding transboundary pollutants that cause harm to the environment, reducing the impact 
of road traffic on the environment is vital to the government and individuals. Of all emissions 
contributed mainly by road transport, monitoring PMio and NOx are currently most desired, with 
expectations that finer detection (e.g. PM2.5) will be needed in the future and also a need for more 
detailed monitoring of noise levels. As addressing global warming becomes more important in 
government policy however, local authorities are likely to be increasingly required to monitor and 
reduce greenhouse gas emissions in their regions. Information on greenhouse gases is therefore also 
needed for long term monitoring purposes with similar linkages to traffic and weather data to 
understand the contribution of traffic to environmental conditions. 
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